Correlation and Regression

Matthew Talluto

22.02.2022

Covariance

\[ pr(y|x) = pr(y) \]

We can use the plot function to make a bivariate scatterplot in order to visualise this:

x = rnorm(1000)
y = rnorm(1000) ## these variables are independent

plot(x, y, pch=16, bty='n')
Independent random variables

Independent random variables

Covariance

\[ \mathrm{cov}_{xy} = \frac{\sum_{i=1}^n \left (x_i - \bar{x} \right) \left(y_i - \bar{y} \right)}{n-1} \]

Correlation

\[ \begin{aligned} \mathrm{cov}_{xy} & = \frac{\sum_{i=1}^n \left (x_i - \bar{x} \right) \left(y_i - \bar{y} \right)}{n-1} \\ r_{xy} & = \frac{\mathrm{cov}_{xy}}{s_x s_y} \end{aligned} \]

Correlation significance testing

\(H_0\): \(r = 0\)

\(H_A\): two sided (\(\rho \ne 0\)) or one-sided (\(\rho > 0\) or \(\rho < 0\))

\(r\) has a standard error:

\[ s_{r} = \sqrt{\frac{1-r^2}{n-2}} \] We can then compute a \(t\)-statistic:

\[ t = \frac{r}{s} \]

The probability that \(t > \alpha\) (i.e., use the CDF of the t distribution) is the p-value.

Correlation test in R

data(iris)
iris_set = subset(iris, Species == "setosa")
n = nrow(iris_set)
r = cor(iris_set$Sepal.Length, iris_set$Petal.Length)
r
## [1] 0.2671758
s_r = sqrt((1-r^2)/(n-2))
s_r
## [1] 0.1390906
t_val = r/s_r
t_val
## [1] 1.920876
2 * pt(t_val, n-2, lower.tail = FALSE) ## two-sided test
## [1] 0.06069778

Correlation test in R

## Equivalent built-in test
with(iris_set, cor.test(Sepal.Length, Petal.Length, alternative = "two.sided"))
## 
##  Pearson's product-moment correlation
## 
## data:  Sepal.Length and Petal.Length
## t = 1.9209, df = 48, p-value = 0.0607
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01206954  0.50776233
## sample estimates:
##       cor 
## 0.2671758

Correlation test: assumptions

Correlation pitfalls

Correlation pitfalls

The linear model

\[ \begin{aligned} \mathbb{E}(y|x) & = \hat{y} = \alpha + \beta x \\ & \approx a + bx \\ \\ \end{aligned} \]

\(y\) is not perfectly predicted by \(x\), so we must include an error (residual) term:

\[ \begin{aligned} y_i & = \hat{y_i} + \epsilon_i \\ & = a + bx_i + \epsilon_i\\ \\ \epsilon & \sim \mathcal{N}(0, s_{\epsilon}) \end{aligned} \]

Method of least squares

\[ \begin{aligned} \hat{y_i} & = a + b x_i \\ y_i & = \hat{y_i} + \epsilon_i \\ \epsilon & \sim \mathcal{N}(0, s_{\epsilon}) \end{aligned} \]

We want to solve these equations for \(a\) and \(b\)

The “best” \(a\) and \(b\) are the ones that draw a line that is closest to the most points (i.e., that minimizes \(s_{\epsilon}\))

Method of least squares

\[ s_{\epsilon} = \sqrt{\frac{\sum_{i=1}^n \left (y_i -\hat{y_i}\right )^2}{n-2}} \]

\[ \begin{aligned} \mathrm{ESS} & = \sum_{i=1}^n \left (y_i -\hat{y_i}\right )^2 \\ & = \sum_{i=1}^n \left (y_i - a - bx_i \right )^2 \end{aligned} \]

Ordinary least squares estimation

Solving for the minimum ESS yields:

\[ \begin{aligned} b & = \frac{\mathrm{cov_{xy}}}{s^2_x} \\ \\ & = r_{xy}\frac{s_y}{s_x}\\ \\ a & = \bar{y} - b\bar{x} \end{aligned} \]

The parameters have standard errors:

\[ s_a = s_{\hat{y}}\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum{(x-\bar{x}}^2)}} \]

\[ s_b = \frac{s_{\hat{y}}}{\sqrt{\sum{(x-\bar{x}}^2)}} \]

Significance tests

\(\mathbf{H_0}\): The slope \(\beta\) = 0 (i.e., no variation in \(y\) is explained by variation in \(x\))

The ratio of variance explained to residual variance follows an \(F\) distribution

\[\begin{aligned} F & = \frac{MS_{model}}{MS_{err}} \\ MS_{model} & = \sum_{i=1}^n \left ( \hat{y}_i - \bar{y}\right)^2 \\ MS_{err} & = \frac{\sum_{i=1}^n \left ( y_i - \hat{y}_i \right)^2}{n-1} \end{aligned}\]

Goodness of fit

The coefficient of determination tells you the proportion of variance explained by the model:

\[ r^2 = \frac{\sum_{i=1}^n \left ( \hat{y_i} - \bar{y} \right )^2} {\sum_{i=1}^n \left ( y_i - \bar{y} \right )^2} \]

Regression assumptions

set.seed(123)
y = rpois(200, 7.5)
c(
  mean(log(y)),
  log(mean(y)))
## [1] 1.962674 2.024193
y = c(y, 0)
c(
  mean(log(y)),
  log(mean(y)))
## [1]     -Inf 2.019206

Transformations

Regression in R

## Data on sparrow wing lengths from Zar (1984)
dat = data.frame(age = c(3:6, 8:12, 14:17), 
        wing_length = c(1.4, 1.5, 2.2, 2.4, 3.1, 3.2, 3.2, 3.9, 4.1, 4.7, 4.5, 5.2, 5.0))
mod = lm(wing_length ~ age, data = dat)
summary(mod)
## 
## Call:
## lm(formula = wing_length ~ age, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30699 -0.21538  0.06553  0.16324  0.22507 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.71309    0.14790   4.821 0.000535 ***
## age          0.27023    0.01349  20.027 5.27e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2184 on 11 degrees of freedom
## Multiple R-squared:  0.9733, Adjusted R-squared:  0.9709 
## F-statistic: 401.1 on 1 and 11 DF,  p-value: 5.267e-10
confint(mod)
##                 2.5 %    97.5 %
## (Intercept) 0.3875590 1.0386300
## age         0.2405308 0.2999272

Regression in R: diagnostics

par(mfrow=c(1, 2), bty='n')

## scale(residuals) produces **standardized** residuals
qqnorm(scale(residuals(mod)), pch=16)
qqline(scale(residuals(mod)), lty=2)
plot(dat$age, residuals(mod), xlab = "age", ylab = "residuals", pch=16)
abline(h = 0, lty=2)


## also try
## plot(mod)

Presenting regression output

We found a significant positive relationship between wing length and age (\(F_{1,11} = 400\), \(p < 0.001\), \(R^2 = 0.97\); Table 1)

Table 1. Parameter estimates for regression of wing length on age
estimate st. error 95% CI
intercept 0.71 0.150 0.39, 1.03
age 0.27 0.013 0.24, 0.30

Multiple regression

\[ \hat{y} = a + b_1x_1 + b_2x_2 \]

Multiple regression

\[ \hat{y} = \mathbf{X}\mathbf{B} \]

\(\mathbf{B}\) is the parameter vector

\(\mathbf{X}\) is the design matrix

\[ \mathbf{X} = \begin{bmatrix} 1 & x_{1,1} & x_{2,1} & \dots & x_{k,1} \\ 1 & x_{1,2} & x_{2,2} & \dots & x_{k,2} \\ \vdots & \vdots & \vdots & \dots & \vdots \\ 1 & x_{1,n} & x_{2,n} & \dots & x_{k,n} \\ \end{bmatrix} \]

Multiple regression

\[ \begin{aligned} \hat{y} & = \mathbf{X}\mathbf{B} \\ y &= \mathbf{X}\mathbf{B} + \epsilon \\ \epsilon & \sim \mathcal{N}(0, s_\epsilon) \end{aligned} \]

The equation is a linear system and can be solved with linear algebra by OLS, minimizing the sum of squared errors:

\[ \mathrm{min}: \sum \epsilon^2 = \sum \left (\mathbf{X}\mathbf{B} - y \right)^2 \]

Or in R we can just add terms to the formula ;-)

lm(y ~ x1+x2)

Hypotheses

  1. \(\mathbf{H_{0,regression}}\): The model (i.e., the linear system \(\mathbf{XB}\)) does not explain any of the variation in \(y\).
    • Test via ANOVA (F-ratio of explained to residual variance)
  2. \(\mathbf{H_{0,coef_i}}\): The slope of the relationship between \(y\) and \(x_i\) is zero.
    • Test with \(t\)-statistic (in output of lm) (or check if CI overlaps 0)
    • Alternatives: ANOVA (nested models), AIC (all models)

So which predictor is best?

  1. Consider dropping non-significant predictors
    • unless you know/strongly suspect they are important
    • \(x_1\) might be significant, but only when \(x_2\) is in the model!
    • interactions (more on this later) must always have the main effect (\(b_1x_1 + b_2x_2 + b_3x_1x_2\))
    • higher-order polynomials should always have lower-order (\(b_1x+b_2x^2\))
  2. Compare slopes in a scale independent way
    • standardize your predictors (especially if \(s_{x_1} >> s_{x_2}\)), or
    • standardize the coefficients

\[ b_i^* = b_i \frac{s_{x_i}}{s_y} \]

Pitfalls: Model flexibility

lm(brain ~ mass, data = hominid)
lm(brain ~ mass + I(mass^2), data = hominid)
lm(brain ~ mass + I(mass^2) + I(mass^3), data = hominid)
## etc...

The curse of dimensionality

Multiple comparisons: n = 200

set.seed(1515)
n_x = 20
n_obs = n_x*10
y = rnorm(n_obs)  ## random y-variable

## 20! random x-variables, no relationship to y
x = matrix(rnorm(n_obs*n_x), ncol=n_x)
dat = data.frame(cbind(y, x))
colnames(dat) = c("y", paste0('x', 1:n_x))

## y ~ . means use every remaining variable in the data frame
mod = lm(y ~ ., data = dat) 
summary(mod)
## 
## Call:
## lm(formula = y ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.71527 -0.70863 -0.09133  0.78663  2.73273 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.0510218  0.0838196   0.609   0.5435  
## x1          -0.0992465  0.0832114  -1.193   0.2346  
## x2          -0.2188291  0.0850971  -2.572   0.0109 *
## x3           0.0100788  0.0828715   0.122   0.9033  
## x4          -0.0504836  0.0802768  -0.629   0.5302  
## x5           0.1498394  0.0897593   1.669   0.0968 .
## x6           0.0351077  0.0849153   0.413   0.6798  
## x7          -0.0366265  0.0750078  -0.488   0.6259  
## x8          -0.0007617  0.0832326  -0.009   0.9927  
## x9           0.0001057  0.0851478   0.001   0.9990  
## x10          0.0961346  0.0908632   1.058   0.2915  
## x11         -0.1038063  0.0855094  -1.214   0.2264  
## x12         -0.0371452  0.0825934  -0.450   0.6534  
## x13          0.0299666  0.0847690   0.354   0.7241  
## x14          0.0931749  0.0805042   1.157   0.2487  
## x15         -0.0174443  0.0828449  -0.211   0.8335  
## x16         -0.0980819  0.0869842  -1.128   0.2610  
## x17          0.0593543  0.0827010   0.718   0.4739  
## x18         -0.0296575  0.0768314  -0.386   0.6999  
## x19          0.0342252  0.0888166   0.385   0.7004  
## x20         -0.0040929  0.0917435  -0.045   0.9645  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.11 on 179 degrees of freedom
## Multiple R-squared:  0.08315,    Adjusted R-squared:  -0.0193 
## F-statistic: 0.8116 on 20 and 179 DF,  p-value: 0.6974

Multiple comparisons: n = 40

set.seed(1515)
n_x = 20
n_obs = n_x*2
y = rnorm(n_obs)  ## random y-variable

## 20! random x-variables, no relationship to y
x = matrix(rnorm(n_obs*n_x), ncol=n_x)
dat = data.frame(cbind(y, x))
colnames(dat) = c("y", paste0('x', 1:n_x))

## y ~ . means use every remaining variable in the data frame
mod = lm(y ~ ., data = dat) 
summary(mod)
## 
## Call:
## lm(formula = y ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.37931 -0.53334  0.00515  0.30438  1.80457 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.36919    0.23534   1.569   0.1332  
## x1          -0.37769    0.19326  -1.954   0.0655 .
## x2          -0.26935    0.20836  -1.293   0.2116  
## x3           0.01585    0.18513   0.086   0.9327  
## x4          -0.24950    0.18277  -1.365   0.1882  
## x5           0.11977    0.23935   0.500   0.6226  
## x6          -0.11829    0.22575  -0.524   0.6064  
## x7           0.32152    0.23386   1.375   0.1852  
## x8           0.28923    0.21415   1.351   0.1927  
## x9          -0.40926    0.22296  -1.836   0.0821 .
## x10         -0.67912    0.24546  -2.767   0.0123 *
## x11         -0.05407    0.22265  -0.243   0.8107  
## x12          0.06647    0.20304   0.327   0.7470  
## x13         -0.05534    0.23823  -0.232   0.8188  
## x14         -0.57639    0.25363  -2.273   0.0349 *
## x15         -0.08342    0.19241  -0.434   0.6695  
## x16         -0.08631    0.22833  -0.378   0.7096  
## x17          0.40355    0.20550   1.964   0.0644 .
## x18          0.60715    0.34957   1.737   0.0986 .
## x19         -0.27751    0.21418  -1.296   0.2106  
## x20         -0.16230    0.21580  -0.752   0.4612  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9787 on 19 degrees of freedom
## Multiple R-squared:  0.6081, Adjusted R-squared:  0.1956 
## F-statistic: 1.474 on 20 and 19 DF,  p-value: 0.201

Additional assumptions

Ignoring \(y\) for a moment, we can perform regressions of the \(x\) variables against each other:

\[ x_i = b_0 + b_1x_1 \dots b_kx_k +\epsilon \mathrm{~;~excluding~x_i} \]

\[ \mathrm{VIF}_i = \frac{1}{1-R^2_i} \]

The VIF tells you that \(s_b = s_{b,\rho=0}\sqrt{\mathrm{VIF}}\)